Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

18 ◾ Bioinformatics

1.5 FASTQ READ QUALITY ASSESSMENT

The quality control for obtaining good sequencing data begins by isolation of pure nucleic

acid from the samples of interest. After isolation, the nucleic acid quality is usually assessed

before sequencing. The nucleic acid must be pure and have sufficient concentration as rec-

ommended by the kits used for library preparation. If the Nanodrop is used, the purity of

the nucleic acid is measured by the ratio of light absorbances at 260 and 280 nm. A ratio of

~1.8 is generally accepted as pure for DNA and a ratio of ~2.0 is generally accepted as pure

for RNA. Similarly, absorbance at 230 nm is usually due to other contamination. For pure

nucleic acid, the 260/230 value is often higher than the respective 260/280 value. The purity

of the RNA is better to be measured by the Bioanalyzer which provides the RNA integrity

(RIN). The RIN ranges from 1 to 10, where 10 is the best RNA integrity or non-degrading

RNA. Most commercially available RNA library preparation kits require an RNA of a RIN

value greater than 7 (RIN>7) for proper library construction.

Another quality control check point is performed after library preparation and before

sequencing. The quality of the nucleic acid libraries can also be assessed to ensure that

there are no remaining adaptor primers that may cause adaptor dimers in the sequenced

DNA fragments. The adaptor dimers are DNA produced from complete adaptor sequences

that can bind and cluster on the flow cell and generate contaminating reads, which nega-

tively impact the quality of sequencing data.

The quality assurances in the steps of nucleic acid isolation and removing of adaptor

primers after library preparation help in producing high-quality reads. However, errors

can also be generated during the sequencing. In Illumina instruments, errors in base call

may be made due to failure to terminate the synthesis (polymerase remains attached),

forming leading strands that are longer than normal, or due to failure to reverse synthesis

termination in the washing cycle, forming lagging strands. The base call software converts

fluorescence signals into actual sequence data with quality scores, which will be stored in

FASTQ files.

A post-sequencing quality check must be carried out to assess the read quality to ensure

that the sequence data looks good and there are no problems or biases that lead to inaccu-

rate or misleading results. FastQC [10] is the most popular program for assessing the qual-

ity of the sequencing data produced by high-throughput instruments. FastQC provides a

simple way to assess the quality of the raw data and generates summary and simple graphi-

cal reports that we can use to have a clear idea about the overall quality of the raw data and

any potential quality problem. FastQC can be downloaded for all platforms from “https://

www.bioinformatics.babraham.ac.uk/projects/fastqc/”. You can use the following steps to

install the latest FastQC version, as of this time, on Linux (Ubuntu):

Use “wget” command to download the current version; the current version is v0.11.9,

which may change in the feature.

wget https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

fastqc_v0.11.9.zip

Decompress the zipped file with unzip: